Perf bwd hip #23

carlushuang · 2022-09-28T12:31:39Z

python test/split_table_batched_embeddings_test.py SplitTableBatchedEmbeddingsTest.test_backward_adagrad_fp32_pmSUM
python test/split_table_batched_embeddings_test.py SplitTableBatchedEmbeddingsTest.test_backward_optimizers_adagrad

Can pass above 2 UTs ( by some modification)

support fp16/fp32 data type for emb_t and grad_t combination
support D=64, 128, 192, 256
support exact for now.
support rowwise-adagrad for now. And different optimizer can be support by different template lambda functor
support all 3 weight_decay_mode in rowwise-adagrad
support emb table duplication (like in test test_backward_adagrad_fp32_pmSUM)

split_tbe_bwd.hip.cpp needed a hpp counterpart to invoke the necessary macros to build all templates for split_tbe_bwd_hip_kernel_

dllehr-amd · 2022-11-11T23:04:05Z

fbgemm_gpu/codegen/embedding_backward_split_template.cu

@@ -8,6 +8,10 @@
 {% set wdesc = "weighted" if weighted else "unweighted" %}
 #include "fbgemm_gpu/embedding_backward_template_helpers.cuh"
 #include "fbgemm_gpu/split_embeddings_utils.cuh"
+#include "hip_kernel/split_tbe_common_hip.h"


Use
#ifdef HIP_PLATFORM_HCC
#endif
Around new includes

dllehr-amd · 2022-11-11T23:04:27Z

fbgemm_gpu/codegen/embedding_backward_code_generator.py

@@ -45,6 +45,7 @@
 # An optimization for ROCm
 env.globals["items_per_warp"] = 128 if args.is_rocm is False else 256


If we have use_rocm being exposed to jinja, we can probably avoid needing an extra "items_per_warp"

dllehr-amd · 2022-11-11T23:09:40Z

fbgemm_gpu/codegen/embedding_backward_split_template.cu

+    static hipFunction_t hip_kernel_func;
+{% if optimizer == "rowwise_adagrad" and not dense %}
+    std::set<int> D_emb_s {64, 128, 192, 256};
+    bool hip_opt_kernel_supported = (D_emb_s.find(max_D) != D_emb_s.end()) &&


When we do mixed dimension this check will go away.

IFU targets on upstream 5219dc4

carlushuang added 4 commits September 18, 2022 09:49

add init bwd kernel(not ready)

6f27e85

compiler OK now

b546e59

fix bug in bwd

369f076

modify 2 UTs

18fda96

amathews-amd assigned dllehr-amd Sep 28, 2022

HaiShaw self-requested a review September 28, 2022 17:18

carlushuang and others added 3 commits October 16, 2022 06:49

Merge remote-tracking branch 'origin/performance' into perf-bwd-hip

920ef65

build inside a single so

70bccb7

Add hpp counterpart for bwd pass

1a94a2d

split_tbe_bwd.hip.cpp needed a hpp counterpart to invoke the necessary macros to build all templates for split_tbe_bwd_hip_kernel_

dllehr-amd reviewed Nov 11, 2022

View reviewed changes

liligwu added 2 commits November 21, 2022 20:48

add weighted to backward pipelined hip

502af32

clean up comments

984032c

liligwu pushed a commit that referenced this pull request Feb 8, 2023

Merge pull request #23 from ROCmSoftwarePlatform/IFU-main-2022-04-07

2c514c5

IFU targets on upstream 5219dc4

xinyazhang mentioned this pull request Jun 10, 2024

Port PR23's Performance kernel to latest main #59

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf bwd hip #23

Perf bwd hip #23

carlushuang commented Sep 28, 2022 •

edited

Loading

dllehr-amd Nov 11, 2022

dllehr-amd Nov 11, 2022

dllehr-amd Nov 11, 2022

		@@ -45,6 +45,7 @@
		# An optimization for ROCm
		env.globals["items_per_warp"] = 128 if args.is_rocm is False else 256

Perf bwd hip #23

Are you sure you want to change the base?

Perf bwd hip #23

Conversation

carlushuang commented Sep 28, 2022 • edited Loading

dllehr-amd Nov 11, 2022

Choose a reason for hiding this comment

dllehr-amd Nov 11, 2022

Choose a reason for hiding this comment

dllehr-amd Nov 11, 2022

Choose a reason for hiding this comment

carlushuang commented Sep 28, 2022 •

edited

Loading